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For a massive I/O array of size n, we want to compute the first N local moments, for some constant N. Our 
simpler algorithms partition the array into consecutive ranges called bins, and apply not only to local-moment 
queries, but also to algebraic queries. With N buffers of size v /n, time complexity drops to 0( v /n). A more 
sophisticated approach uses hierarchical buffering and has a logarithmic time complexity (0(b \og b n)), when 
using N hierarchical buffers of size n/b. Using Overlapped Bin Buffering, we show that only one buffer is 
needed, as with wavelet-based algorithms, but using much less storage. 
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1. INTRODUCTION 

In a data-driven world where permanent storage devices have an ever growing capacity, I/O 
access becomes a bottleneck when read/write performance does not increase as quickly as 
capacity [Vitter 2002]. As Gray put it: "we're able to store more than we can access" [Pat- 
terson 2003]. At the same time, users of applications in OLAP [Codd et al. 1993] or in 
visualization [Silva et al. 2002] now expect online processing of their data sets. A strategy 
to solve this problem is to use a relatively small internal-memory buffer to precompute 
some of the expected queries. For example, it might be reasonable to use a memory buffer 
of one megabyte ( 1 MB) 1 per terabyte (TB) of external-memory data. 

If X is a random variable with probability distribution /, we define the moment of order 
N about a value c as the expectation of (X — c) N or E((X — c) N ) = f(x — c) N f{x)dx. Cor- 
respondingly, given an array a and some constant c, — c) N ai is a moment of order N. 
Given an arbitrarily large array a, we can precompute the moments with ease, however we 
are interested in local moments: given a range of indices p,...,q and a constant c all pro- 



'Throughout, we use the familiar units of kB, MB, GB and TB to measure storage in groups of 2 10 , 2 20 , 2 30 and 
2 40 bytes. Our units thus coincide with the new IEC units KiB, MiB, GiB and TiB [IEC 1999]. 
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vided dynamically, we wish to compute Lf=p(' — c ) Na i online. Frequency moments [Alon 
et al. 1996], where is the number of occurences of item i, are outside the scope of 
this paper. 

Local moments are used widely, from pattern recognition and image processing to mul- 
tidimensional databases (OLAP). Among other things, they have been proposed as a re- 
placement for maximum likelihood methods in statistics, to cope with the magnitude of 
the new data sets [Scott and Sagae 1997]. 

As an example application, given the number of items in a store for every possible price, 
the sum (moment 0) over a range of prices would return how many items are available, the 
first moment would return the total dollar value for items in the specified price range, and 
the second moment would allow us to compute the standard deviation and variance of the 
price. 

As another example, imagine a moving sensor measuring the density of some metallic 
compound in the ground. A geophysicist could then ask for the average density of this 
compound over some terrain or for the "center of mass" in some region. Indeed, if the 
array a contains the density over some one-dimensional strip of terrain, then the average 
density over some region given by indices p, . . . , q is given by a,/ (q — p) whereas the 
center of mass is given by (£f = _ ia{) /Xf = „ a*. 

As yet another example, consider the local regression problem [Cleveland and Loader 
1995]. Suppose that given a large segment p, . . . ,q of a very large array y,-, the user wants 
to view the best polynomial of order N—l fitting the data: this problem occurs when try- 
ing to segment time series [Lemire 2007], for example. Given a polynomial Y!k=n a k^y 
the residual energy is given by YH= p (yi ~ Y*k=0 a ki k ) 2 - Setting the derivative with re- 
spect to a/ to zero for / = 0, . . . ,N — 1, we get a system of N equations in N unknowns, 
tk=o a kLl= p i k+l = Ltpyi'' where l = 0,...,N-l. Note that the right-hand-sides of the 
equations are N local moments whereas on the left-hand-side, we have the N x N matrix 
Akj — Y!i= p i k+l - F° r an Y k + l, we can compute the sum Y!j= p ' k+l m constant time irre- 
spective of q - p. To see this, consider that because Y?i= p ' k+l = Lf=o ' k+ ' ~ T*f=o i k+ ' ! > it is 
sufficient to be able to compute expressions of the form £f =0 z' ft+ ' quickly. However, there 
is a formula for these summations. 

Several fast techniques have been proposed to compute local moments [Li and Shen 
1992; Zhou and Kornerup 1995] but this paper is concerned with precomputing auxiliary 
information to speed up the computation of local moments. 



2. NOTATION 

We use C-like indexing for arrays: an array of length n is indexed from to n — 1 as in 
do,.. -,a n -\. Indices are always integers, so that the notation € [k,l] for an index means 
that i € {k, I}. Given a set R, the set of all finite arrays of values in R is denoted by ^L R . 
The restriction of a function / to domain D is noted f\ D . Some common functions over 
real-valued arrays ( _# R ) or "range-query functions" include COUNT and SUM, where 
COUNT(ao, • • • = n and SUM(ao, • • • = L"=o a i- Other possible queries in- 

clude MAX (which returns the maximum value in a range) or MAX_N (which returns the 
N largest values found in a range). The query moment of order N is formally defined as 
Y?i= p {i ^ p) N<2 i- Note that computing query moments of order N, for various N, allows the 
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calculation of any local moment. Indeed, notice that 

i(i-c) N ai = £ ( N \p-cf- k t{i-pU (1) 

i=p k=0\ K / i=p 

for any constant c, by the expansion of ((/ — p) + (p — c) ) N using the Binomial Theorem. 
3. RELATED WORK 

Using a buffer, we can compute linear range queries in 0(1) time. For example, given 
an array A = {ao,a\,. . . ,a n -\}, we can use the Prefix Sum method [Ho et al. 1996] and 

precompute the array PS(A) = {flo,flo + fli,«o + a\ + «2, • • ■ ,«(H \-a n -{\. By a mere 

subtraction, we can then compute any range sum of the form at + a^+i H ha/ since 

a k -\ Yen = (a H ha,) - (a H r-a^_i). 

However, if a few data points are updated, then all of PS(A) may need to be recomputed: 
updates require 0(n) time. A more robust approach is the Relative Prefix Sum (RPS) 
method [Geffner et al. 1999] which buffers the prefix sums only locally over blocks of size 
b. For example, 

RPS(A) = {ao,ao + ai,ao + ai +02,03,03 + 04, •• •} 

when b = 3. Clearly, RPS(A) can be updated in time 0(b). To still achieve 0(1) query 
time, we use an overlay buffer 

oo + oi +02,00 H ho 5 , . . . ,ooH ho n _! 

that can be updated in time 0(n/b). While RPS requires &(n) in storage, updates can be 
done in O(^fn) time by choosing b = ^fn. We can improve the update performance of RPS 
using the Pyramidal Prefix Sum (PyRPS) method [Lemire 2002]. Its query complexity 
is <9(p) with update cost (9(pn 1//p ), where p = 2,3,... Thus, PyRPS obtains constant- 
time queries but with faster updates than RPS. Furthermore, PyRPS supports queries and 
updates in logarithmic time by choosing p = log(n). 

Of course, these techniques extend to other range queries. In the context of orthogonal 
range queries for multidimensional databases, similar results are even possible for range- 
maximum queries [Poon 2003]. 

The PS, RPS, and PyRPS methods (and similar alternatives for other queries) have a 
common inconvenience: each type of range query is buffered separately: Y,k a k, Y^k^ a k, 
Yikk*~ a k, ■ ■ ■ An equivalent view is of a single tuple-valued buffer, rather than several real- 
valued buffers. The ProPolyne framework [Schmidt and Shahabi 2002] showed how to 
avoid tuple-valued buffers by using wavelets. ProPolyne has both logarithmic queries and 
updates at the cost of a buffer of size 0(n) for computing local moments (Polynomial Range 
Queries). ProPolyne simultaneously buffers all local moments up to a given degree. So, 
ProPolyne reduces storage when compared with prefix-sum methods such as PyRPS, but at 
the expense of constant-time queries. See Table I for a comparison of various alternatives 
to buffer local moments. 

In a wavelet framework such as ProPolyne, we can keep only the most significant 
wavelet coefficients to reduce storage and increase performance, while obtaining reason- 
ably accurate results [Chakrabarti et al. 2001; Jahangiri et al. 2005; Vitter et al. 1998]. It is 
also possible to process the queries incrementally so that approximate results are available 
sooner. 

ACM Transactions on Computational Logic, Vol. V, No. N, February 2008. 



4 



D. Lemire and O. Kaser 



Table I. Comparison of local moment algorithms with corresponding storage requirements and 
complexity for large n where we buffer the first N moments. Note that p = 2,3 . . . is a parameter 
that can be chosen to be large. The storage requirement is the number of components needed to 
buffer computations. Ola is a form of Bin Buffering specific to local moments. 



Algorithm 


Query 


Update 


Storage 


One-Scale Ola 


0(Nn/b+N 2 b) 


0(N) 


n/b+l 


Hierarchical Ola 


0{N 2 blog b n) 


0(N 2 log b n) 


n/b+l 


Bin Buffering 


0(Nn/b + Nb) 


0(N) 


Nn/b 


Hierarchical Bin Buffering 


0{Nb\og h n) 


0(N\og h n) 


Nn/b 


ProPoIyne 


0{N 2 \og 2 n) 


0(N 2 \og 2 n) 


n 


Prefix Sums 


0(N) 


0(Nn) 


Nn 


Relative Prefix 


0{N) 


0{N^fi) 


Nn 


PyRPS 


0(Np) 


O(Np^n) 


Nn 


PyRPS (log) 


O(Mogn) 


O(Nlogn) 


Nn 



4. CONTRIBUTION AND ORGANIZATION 

For storage, a reduction from n to ^Jn can be quite significant: if n = 2 40 (1 TB), then 
\fn = 2 20 (1 MB), so we argue that simple buffering schemes might often prove more 
practical than PyRPS or ProPoIyne, especially because a small buffer can be generally 
constructed faster. As Ho et al. [Ho et al. 1996] observed (for the Prefix Sum Method), 
good performance can be obtained with a small buffer, provided we retain access to the 
original array. 

The paper is organized as follows. We first consider how bin buffers can be used to 
speed up many range queries over dynamic external arrays (section 5). For each bin or 
"range of indices," Bin Buffering [Moerkotte 1998] associates a single buffer component. 
However, its scalability is limited because very large buffers do not improve performance 
and can even worsen it. In section 6, we present the analysis of a variant, Hierarchical Bin 
Buffering, which supports logarithmic queries and updates even for modest buffers. These 
results are novel, but have been alluded to in the concluding section of a paper [Moerkotte 
1998]. In section 7, we present a novel buffering framework: Overlapped Bin Buffering. In 
Overlapped Bin Buffering, each buffer component depends not only on one but a range of 
bins. In this context, we present Lagrange Interpolation (section 8) as a tool to compute a 
useful buffer and establish an explicit link between buffering and interpolation. The result 
is an Ola buffer and we show it can be hierarchical as well. Section 9 presents precisely 
stated algorithms and experimental results. We then proceed on some concluding remarks. 
We elaborate on the differences between Ola and BIN BUFFERING (section 10) and show 
how Ola can support efficient progressive approximate queries (section 1 1). 

5. FAST ALGEBRAIC RANGE QUERIES USING PRECOMPUTATION OVER BINS 

Let R be an algebraic structure such as R or K m . A range-query function Q : Jl R — > R is 
distributive [Gray et al. 1996] if there is a function F : R R — > R such that for all < k < 
rt-1, 

Q{ao,...,a k ,a k+ \,...,a n -\)=F{Q{ao,...,a k ),Q{a k+ i,...,a n -i)). 

Examples of distributive range-query functions include COUNT, SUM, and MAX. In 
this paper, we shall only consider range queries where the computational cost is indepen- 
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dent of the values being aggregated. By convention, F(Q(ao, . . . ,a k )) = Q(ao, ■ ■ -,a k ); i.e., 
the function F is the identity when applied to a single value. 
We have 

Q(a , ...,a„_i) = F(Q(a ,...,a n - 1 )) 

= F(Q(ao,...,a n -2),Q{a„-i)) 
= F(F(Q(a Q , . . . ,a„- 2 )) ,Q(a n -i)) 
= F(F(F(Q(ao,...,a n - 3 ),Q(a n - 2 )), 

and so we can compute F(Q(ao, . . . ,a n -i)) recursively, using n— 1 pairwise aggregations. 
Hence, a distributive range query over n terms has complexity 0(n). 

Distributive range-query functions can be combined. For example, define the joint query 
function (Q\ , Q 2 ) as the tuple-valued query function 

(Gi,C2)(ao,---,a«-i) = {Q\{ao,...,a n -\), Q.2(ao,..., a n -\)). 

We can verify that (Qi,Q 2 ) is distributive if Q\ and Q2 are distributive. 

In this paper, a real-valued range-query function Q : — > M is algebraic if there is 
an intermediate tuple-valued distributive range-query function G : Jl R — > R™ from which 
g can be computed. For example, given the tuple (COUNT, SUM), one can compute 
AVERAGE by a mere ratio, so AVERAGE is an example of an algebraic query function. 
In other words, if Q is an algebraic function then there must exist G and F : Jl R — > R m for 
some fixed integer m such that 

G(ao,...,a k ,a k+ \,...,a n -\) = F(G(ao,...,a k ),G(a k+ i,...,a n -i)). 

We have that algebraic queries can be computed in time 0(n) because we can interpret 
them as distributive queries over tuples. All real-valued distributive range-query functions 
are algebraic, but examples of non-distributive algebraic functions include MAX_N, AV- 
ERAGE, CENTER_OF_M AS S , S TAND ARD_DE VI ATION , and local moments. As an 
example, if Q is AVERAGE, then we can choose G to compute the tuples (COUNT, SUM) 
and F can be the component-wise sum. Similarly, if Q is the local moment of order N, 
then G should compute the tuple made of (COUNT, SUM, . . ., moment of order N). Com- 
bining two N' h -ovdev query moments (;' — p) N ai and £f =r (j — r) N ai into an aggregate 
Y!j = p{i ~ p) N ai can be done by Eq. (1), which requires that we know the lower-order mo- 
ments. In terms of N, a straightforward implementation of G runs in O(A0 time. Indeed, 
we can compute G(oq, . . . ,a k ) as F(G(ao,...,a k -i),G(a k )) but G(a k ) = (l,a k ,0, . . . ,0). 
Operation F itself requires 0(N) time. Hence, G(ao, . . . ,a k ) can be computed in 0(Nn) 
time. 

In what follows, we will specify algebraic functions as triples (Q,G,F). Whenever we 
use the (Q, G,F) notation, it is understood that 

Q : JZ M -» R, and G : JZ R -» R m , and F : ^ R "' -» R m 

and often Q will not be used explicitly because computing G is enough. We will assume 
that the size of the tuples, m, is small: m < 16. 
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Range Query 

Fig. 1 . An algebraic range query supported by Bin Buffering, as in Eq. (2). 

Given an integer b that divides n and given an algebraic function (Q, G,F), we can buffer 
queries by precomputing b/n components 

G(ao,...,ab-i), 
G{ab,...,ci2b-\),---, 
G(a n _b, . . . ,a„-i) 

denoted Bq, . . . ,B n i b -\. This buffer can be updated in time 0(b) if an array component is 
changed. Using this precomputed array range queries can be computed in time 0(n/b + b) 
because of the formula 

G(a k , ■ ■ . ,a/) = F(G(ai c , • ,B[i/b}-iiG( a bli/b} > ■ ■ -i a i))- ( 2 ) 

See also Figure 1. By choosing b = ^/n, we get updates and queries in time 0(^/n) with 
a buffer of size yfn. In different terms, this algorithm was presented by Moerkotte [Mo- 
erkotte 1998]. 

When buffering local moments of order N, G computes N + 1 -tuples so that the size of 
the buffer is (N+l) xn/b. This can be reduced to N x n/b if all bins are of a fixed size b, 
since we need not store COUNT. 

An algebraic range-query function (Q,G,F) is linear if the corresponding intermediate 
query G satisfies 

G(a + ado, . . . , a„- 1 + ad n - 1 ) = G(a , . . . ,a„_i ) + <xG(d , ...,d n -i) 

for all arrays a,d, and constants a. SUM, AVERAGE and local moments are linear func- 
tions; MAX is not linear. Linear queries over bins of size b can be computed using the 
formula G(ao, ■ ■ - = aoG(e^) + . . . + a b -iG(e ( - b -^), where an array of size b 

satisfying = if i ^ j and ef 1 = 1 . For our purposes, we define an update by the loca- 
tion of the change, k, and by how much the value changed, A = a' k — a^. We see that the 
update complexity for buffered linear range queries is reduced to constant time since 

G(a , . . . ,a k -i,a' k ,a k +i, . . .,a b -i) - G(a , ...,a k ,.. .,a b -i) = (a k - a k )G(e (k ' ) ) 

= G(e {k) )A 

and G(e«) can be precomputed or computed in constant time. 
Hence, we see that: 

(1) All algebraic queries can be bin buffered, including MAX, AVERAGE, and local mo- 
ments. 

(2) For linear queries, the buffer can be updated quickly. 
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LEMMA 5.1. For an algebraic range-query function Q, given an array of size n, Bin 
Buffering uses a buffer ofn/b tuples computed in time 0(n) and updated in time 0(b) to 
support queries in time 0(n/b + b). Choosing b — ^Jn minimizes the query complexity to 
0(\/n). If Q is linear then updates take constant time. 

LEMMA 5.2. Consider local moments of degree N, where N is fixed and small. Given 
an array of size n, Bin Buffering uses a buffer of size N x n/b computed in time 0(n) 
and updated in constant time to support queries in time 0(n/b + b). Choosing b = ^/n 
minimizes the query complexity to 0(\fn). 

If N is not considered fixed, buffer computation is 0(Nn), update time is 0(N) and 
query time is 0(Nn /b + bN) . 

A possible drawback of the Bin Buffering algorithm is that the query complexity cannot 
be reduced by using a larger buffer. For example, in going from a buffer of size n/2 to a 
buffer of size yfn, the algorithm's complexity goes down from 0(n) to O(yfn). We will 
show in the next section how we can use larger buffers in a hierarchical setting to increase 
the speed. 

6. HIERARCHICAL BIN BUFFERING 

In the previous section, we showed we could precompute algebraic range queries over bins 
of size b to support O (n/b + b) -time, queries. We can scale this up using a pyramidal or 
hierarchical approach [Lemire 2002] . 

For a fixed b, the n/b term dominates the 0(n/b + b) complexity. In Eq. (2) the n/b term 
comes from the buffer aggregation. So, we started from an aggregation over n terms and 
reduced it to an aggregation over n/b terms; clearly we can further reduce the aggregation 
over n/b terms to an aggregation over n/b 2 terms by the same technique (see Figure 2). In 
other words, we can buffer the buffer. Hence, considering the buffer of size n/b as a source 
array, we can buffer it using n/b 2 components to support queries in time 0(n/b 2 +b) over 
the buffer instead of 0(n/b). Thus, the end result is to have queries in time 0(n/b 2 + 
2b) with a buffer of size at most n/b + n/b 2 . Repeating this argument \og b n times, we 
get queries in time 0(b\og h n) using Y.k=i....,\o% b n n /b k < n/(b— 1) storage. If the query 
function is invertible, as defined in the next subsection, then we can use in-place storage 
for higher-scale buffers. This reduces the internal memory usage to n/b. The update 
complexity is 0(b\og h n) in general and C^log^n) for linear queries. 

6.1 In-place Storage for Invertible Query Functions 

A given algebraic function (Q 7 G 7 F), is invertible if O(l) time is sufficient to solve forx 
in z = F(x,y), where x,y,z <G K m (m is assumed small). Linear queries are invertible, and 
being invertible is a useful property: it means that the storage used by x can be used to 
store z — storing x,y or z,y is almost equivalent. This lets us "buffer a buffer" in place, as 
the next proposition shows. 

PROPOSITION 6. 1 . If(Q, G,F) is an invertible algebraic query function, then second- 
scale Bin Buffer components B'q = F(Bq, . . . ,Bb-\) and B' h = F(B bl ■ ■ ■ ,B2b-i) 1 ■ ■ ■ can be 
stored in-place at positions 0,£>,2£>, ... in the buffer (overwriting values Bo 1 Bb,B2b, ■ ■ ■) 
without increasing query time complexity. 

Proof. Assume we use in-place storage for the second-scale Bin Buffers B' ,B[,. . ., 
overwriting Bo,Bb,.... We must evaluate expressions of the form F(Bj c , . . . which can 
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Range Query 



Fig. 2. An algebraic range query supported by Hierarchical Bin Buffering. We essen- 
tially repeat Figure 1 over the buffer itself. In the example given, by aggregating buffer 
components, we replace 6 first-scale buffer components by 2 second-scale components. 

be done using the second-scale Bin Buffer, according to the formula 

F(B k ,...,Bi)=F(F(B k ,...,B^/^ b _ l ) 7 B'^ k / b ^...,B' b (y l / hi _ l yF(B b yi/ hi ,...,B l )). 

The only place where an overwritten value appears is in the last term: B b ^/ b j has been 
replaced by the value of B'^^y However, the query is invertible, so B b yi/ b ^ can be re- 
covered in constant time. Thus the algorithm using two-scale buffers is still going to be 
0(n/b 2 + 2b), even though in-place storage has been used. □ 

We can repeat this process for each buffer scale, each time incurring only a fixed cost 
for recovering an overwritten value. The total additional cost is 0(log b n), but this is dom- 
inated by the cost of the query itself (0(b\og b n)). In other words, in-place storage almost 
comes for free. 

As an example of Hierarchical Bin Buffering with in-place storage, consider the array 
flo, • • • ,fl80 in = 81) and some invertible algebraic query function (Q, G,F). The one-scale 
Bin Buffering algorithm with b = 3 simply precomputes 

Bo = G(a Q ,ai,a 2 ),Bi = G(ai,a 4 ,a 5 ), . . . ,B 26 = G(an,a 19 ,a%o) 

so that if we want Q(a\,... ,079), we still have to compute 

F(G(a u a 2 ),B u . . . ,B 25 ,G(a 7& ,a 19 )) 

which is the aggregation of 27 terms using F. We can aggregate the buffer itself in a second 
buffer, in this case, by precomputing 

B' = F(B ,B U B 2 ),B' 3 = F(B 3 ,B 4 ,B 5 ),. . . ,B 24 = F(B 6 ,B 7 ,B & ) 

and storing them in-place, so that Bq,B^, . . . ,624 are replaced by the newly computed 
B' ,B' 3 ,.. .,B' 24 . Then, to compute Q (a 1, ... ,079), it suffices to compute 

F(G(ai,a 2 ),B\,B 2 ,B' 3 ,B' 67 ...,B' 2l ,B 2 4,B 2 s,G(a 1 s,a 1 g)), 

the aggregation of only 13 terms. The query cost is halved without using any additional 
memory. 

As the next lemma explains, the hierarchical case presented above is simply a general- 
ization of the case in the previous section, but where large buffers can be used to answer 
queries in logarithmic time. Recall that local moments are linear and invertible whereas 
MAX queries are neither. 
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LEMMA 6.2. For an algebraic range-query function Q, given an array of size n, Hier- 
archical Bin Buffering uses a buffer of ^ tuples computed in time 0(n) and updated in 
time 0(blog b n) to support queries in logarithmic time 0{b\og b n). If the query is invert- 
ible, then a smaller memory buffer of size n/b can be used; for linear queries updates can 
be done in time 0(log b n). 

For non-invertible queries such as MAX, that is, the worst case scenario, this last lemma 
implies that a buffer of size nj (b— 1) can support queries in time 0(blog h n) with updates 
in time 0(b\og b n). For invertible and linear queries such as SUM, the storage is only 
n/b with updates in time 0(\og b n). Choosing b = 2 minimizes the query complexity 
((3(log 2 «)) while maximizing the storage requirement at n/2, whereas choosing b = *Jn 
reduces to the non-hierarchical (one-scale) case with a query complexity of 0(y/n) and a 
storage requirement of \fn. 

Note that G operates on tuples, and thus a buffer of n/b elements occupies mn/b space, 
offsetting the economical nature of the Hierarchical Bin Buffering algorithm. Another 
result is that the G operation becomes more expensive for higher-order moments; in the 
analysis leading up to Lemma 6.2, we implicitly assumed the cost of G was constant. 
However, if the analysis considers that operation costs increase with N, we have that buffer 
construction is in 0(Nn), updates are in 0(N\og b n) and queries are in 0(Nb\og b n). As 
we shall see in the next section, for some types of queries such as local moments, it is 
possible to avoid using tuples in the buffer. 

7. OVERLAPPED BIN BUFFERING 

In the previous sections, we described Bin Buffering and Hierarchical Bin Buffering as 
it applies to all algebraic queries. Such Bin Buffering is characterized by the facts that 
buffer components, Bq = G(ao, ■ ■ . ,<z&-i), B\ = G(a b , . . . ,a2fc-i), • • •, are over disjoint bins 
and are aggregated using G itself. In this section, we will consider only weighted sums as 
aggregate operators, and we will define buffer components that depend on several bins at 
once. This can also be interpreted as having overlapping bins. Our motivation is to buffer 
local moments using a single real-valued buffer, and we begin by considering one-scale 
buffering in subsections 7.1-8.1, but in subsection 8.2 we will extend our results to the 
hierarchical case. 

7.1 General Case 

Consider an array a of size n indexed as oq, . . . ,a n -\. By convention, a,j = for j ^ [0,n). 
Assuming that n is divisible by b, we group the terms in bins of size b: first oq, . . .,a&_i, 
then ab,-- ■ 1 a2b-\ and so on. We have n/b bins and we want to compute n/b+1 buffer 
components Bq,... ,B n / b to speed up some range-query functions such as SUM. However, 
we drop the requirement that each buffer component correspond to one and only one bin, 
but rather, we allow buffer components to depend on several bins, hence the term "Over- 
lapped Bin Buffering." 

For an array arj, • • -,a n -\ and given integers M,M' > 0, consider buffer components of 
the form 

M'b-l 

Bk= c j a j+kb 
j=-Mb 

where the coefficients Cj are to be determined but are zero outside their range, i.e., Cj = 

ACM Transactions on Computational Logic, Vol. V, No. N, February 2008. 



10 • D. Lemire and O. Kaser 




Fig. 3. Overlapped Bin Buffering with M = 1,M' = 1. Buffered components depend on overlapping areas. 

if j ^ [—Mb, M'b). We could generalize this framework so that M + M' remains a constant 
but that M and M' depend on the bins: we could accommodate the end and the beginning of 
the array so that j + kb is always in [0, n), but this makes the formulas and algorithms more 
pedantic. In essence, the B k are weighted sums over a range of M + M' bins. We can inter- 
pret the coefficients Cj as weights used to compute the buffer component B k , where j gives 
the offset in the original array with respect to kb. An example is shown in Figure 3. Note 
that in the special case where M — 0,M' = 1, we have the usual Bin Buffering approach, 
which maps each bin to exactly one buffer component and where Cj = 1 for j G [0,b). As 
we shall see, increasing M and M' allows for additional degrees of freedom. 

We can compute by replacing / by / such that / agrees with / on [p,q] but 

is zero otherwise, and then compute Y!i=o f(') a i- Thus, it is enough to have a fast algorithm 
to compute Y!!=o f(') a i f° r an arbitrary /. The next proposition presents a formula that is 
instrumental in achieving a fast algorithm. 

PROPOSITION 7.1. Given an array ao,. .. ,a„-i and integers M,M' > 0, and given the 
buffer components B k = Y!f=~Mb c j a i+kb, we have 



n-l 



n/b n-l 



£/(*fr)B t +£8(i)fli 



!=0 



k=0 i=0 



where 



8(0 



u\+m 

f(kb)ci-kb- 



k=n\-M>+i 



Proof. By the definition of B^, we have 



n/b n/b M'b-l 



Y,f(kb)B k =Y< E f(kb)c jaj+kh . 



k=Q k=0 j=-Mb 



Define ;' = j + kb so that 




E E f(kb)cja j+kb = £ £ f(kb)ci- kb ai- 



k=0 j=-Mb k=0 i=kb-Mb 



Because c,_^ is zero whenever ^ [kb— Mb, kb+M'b), we can replace 
in the above equation to get, after permuting the sums, 



• kb+M'b-l 
<i=kb-Mb 



byi; 



• n-l 
'i=0 
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However, note that c,_^ is zero whenever fcfr ^ (i—M'b,i+Mb] ork$. (|_jj —M', |_£J +M]. 



Hence, we can replace Y<k=o by L,,^, ; m , r t0 § et 



«/6 «-i / LiJ+M \ 

E /(**)«* =E ^ f{kb)ci-kb\au 

k=0 i=0 \fc=L^J-M'+l / 

which can be subtracted from Y!t=o /(0 a i t0 P rove the result. □ 
The key idea is that to support fast computations, we want 

8(0 =/(/)- E f(kb)c l -kb = (3) 
k=lh\-M'+i 

for most integers ;', as this implies that we can compute almost all of the range query using 
only the buffer: i.e., from the previous proposition when 8(0 = 0, we have 

„-l n/b-l 

E /('>.■= E fi kb ) B k- 

1=0 k=0 

It seems remarkable that the precomputed Bk values are suitable for use with many func- 
tions, possibly including functions not envisioned when the buffer was initially constructed. 

From 8(i) = we will arrive at Lagrange interpolation in section 8, since we are mostly 
interested in the case where / is locally a polynomial. Moreover, because we want the 
ability to store the buffer component B^ at position kb, we also require that 8(kb) = 0. This 
will ensure that the value aub is never needed in Eq. (3). 

Proposition 7.2. From the definition 

i'b\+ M 

8(0-/(0- E f(kb)ci- kb , 

k=yi\-M'+i 

we have that 5(kb) = for all integers k if 

1 if/ = 
otherwise 



cib - 



Proof. Assume that c/^ is zero whenever I ^ 0, then 

l+M 

s(ib) = f(ib)- E W)cv-k)b 

k=l-M'+l 

= f(lb)-f(lb)c Q 

and the result follows. □ 

Consider the case where M = 0,M' = 1, then B^ = L/=d c j a j+kb- Suppose we want to 
buffer range sums, then / will be 1 except at the endpoints. So, from 8(0 = 0, we see that 
we want 

UJ 

E C i-kb = 1 

*=UJ 
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or c i_[ijfe = 1; that is, c is always 1 within the range of its indices. Thus, we retrieve the 

formula = Y!jZo a j+kb as the unique solution to buffer range sums when M = 0,M' = 
1. As we shall see, for larger overlaps the solution is no longer unique and the problem 
becomes more interesting. 

7.2 An Example: Sum and First Moments (M = l,M' = 1) 

With M = 0,M' = 1, we can buffer SUM queries. Using M = l,M' = 1, we will buffer the 
first two local moments: local range sums (X, a,) and local first moments (X ( za;)- While 
we will support a wider range of queries, the buffer size remains unchanged. However, the 
complexity of the queries does go up when N increases. 
With M = l,M' = 1, 8(/) = implies 

LeJ + i 

8(0 =/(')- E f(kb)c t -a = 0. (4) 

We recognize this problem as the linear Lagrange interpolation of f(i) using values at 
f{\i/b\b) and f([i/b\b + b). The next proposition gives a solution to these equations. 

PROPOSITION 7.3. Given integers n and b, such that b divides n, the equation f{i) — 
Y!1=q 1 f(kb) c i-kb = holds for all linear functions, f(x) = ax + b, if c { = 1 — ^ when 
ie[-b,b). 

Proof. Setting f(x) = 1 and f(x) = x in Eq. (4) yields two equations 

UJ+i UJ+i 
1= E Cl '-**' ' = H kbci_ kh , 

which are true when c; = 1 — y . The general result (f(x) =ax + b) follows by linearity. □ 

We can verify that 8(kb) = 0, using Proposition 7.2: when / ^ {0, — 1} then lb ^ [— b,b), 
hence cu, = 0. Otherwise, en, = 1 — i.e., 1 when I — and when / = — 1. 

8. OVERLAPPED BIN BUFFERING FOR LOCAL MOMENTS: OLA BUFFERS 

Overlapped Bin Buffering as described in the previous section can be used to buffer local 
moments as in subsection 7.2. In the special case of Overlapped Bin Buffering where the 
buffers are computed using Lagrange interpolation, we call the resulting data structure an 
Ola buffer. 

Lagrange interpolation is a common technique discussed in standard Numerical Analy- 
sis references [Rao 2002, section 5.5]. In essence, given a function / and M + M' samples 
of the function /(mi),/(/W2), ■ ■ ■ ,f( m M+M')> then 

(1) we solve for the unique polynomial p of degree M + M 1 — 1 such that p(m\) = 
f(mi),p(m 2 ) =f(m 2 ),...,p(m M+M /) = f{m M+M ,); 

(2) we evaluate the polynomial p at x and return this as the interpolated value. 

It should be evident that if / is itself a polynomial of degree at most M + M 1 - 1, then the 
interpolation error will be 0; that is, p(x) = f(x). We say that Lagrange interpolation of 
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order M + M' — 1 reproduces polynomials of degree M + M' — 1 . Also, Lagrange interpo- 
lation is optimal, in the sense that it uses the smallest possible number of samples while 
reproducing polynomials of a given degree. 

Given M + M' samples mi,.. ■ ,m M+M i of a function /, the Lagrange formula for the 
interpolated value at x is given by 

m M+M' ( m M+M' _ \ 

rv>-JL n J=£ )m 

k=mi \m=mi;m^=k / 

In Proposition 7.1, we introduced the function 8(z) given by 

li\+ M 

m- £ ci- a f(kb) 

k=ll\-M'+\ 

which can also be written as 

+M 

m- E c r . kb m+i-])b) 

k=-M'+l 

where r = i- On the other hand, as a direct consequence of the Lagrange formula, 

choosing mi = b([^\ — M' + 1), . . .,m M+M i = b([^\ + M) withx = i, we have 

*=-M'+l \m=-M'+l,m#<t KD mD / 

Now, by Lagrange's formula, if / is a polynomial of degree M + M' — 1 and if 

r-mb 

Cr ~ kh ~ m =-M>\l^k kb - mbl 

then 8(/) = 0. When M = l,M' = 1, Lagrange interpolation becomes equivalent to linear 
splines and is trivial to compute: if k = 0, c r = ^ else if k = 1, c r _^ = |; we conclude 

thatc r _ tt = l-^. 

For M + M' > 2, we can simply apply the formula 

it r-mfr 
X\m=-M'+\^k r / b - m 



(M-ifc)!(Jfc + M'-l)! 

and precompute the coefficients once for b possible values of r = 0, . . . b — 1 and M+M' — 1 
possible values of k = —M'+ I,... ,M. This gives a total of [M + M' — \)b coefficients. 

The next lemma applies these results and says that Overlapped Bin Buffering efficiently 
buffers the first M+M' moments when using coefficients derived from Lagrange interpola- 
tion. Hierarchical Ola with b = sjn is One-Scale Ola, thus we need not make an explicit 
distinction between the two. 
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LEMMA 8.1. Overlapped Bin Buffering with overlap parameters M,M' > allows 
8(z) = for polynomials of degree M + M' — 1, if the coefficients c are chosen to be the 
Lagrange coefficients of degree M + M' — 1. 

8.1 Local Moments Using One-Scale Ola 

We have already seen that each buffer value B^ in the Ola buffer is a sum over several bins, 
Bk = Y.f = h rMb c j a j+kb where the Cj are determined by Lagrange interpolation or directly as 
in subsection 7.2. By Lemma 8.1, if / is any polynomial of degree M + M' — 1, then 
local moment queries of the form £ ( /(z')<z; can be answered from the Ola buffer alone. 
However, consider function f{x) = x over [2, 10] and zero elsewhere. This function is used 
when computing the first-order query moment Yd=2 ' a i- (^ ee subsection 7.1.) Our approach 
must be refined to handle / and other useful functions. 

An alternate viewpoint to the approach emphasizes how Lagrange interpolation provides 
an approximation h to function /. Knowing how h differs from / allows us to compensate 
for /'s not being polynomial or (in section 11) allows us to bound the error from imperfect 
compensation. The details of this viewpoint follow. 

By the proof of proposition 7.1, if we define 

h (>) = E f( kb ) c i-kb 
k=ll\-M>+l 

then Y,h{i) a i = Y.f{kb)Bk- It is interesting to note that given some function /, we can 
compute h using a Lagrange polynomial for each bin as follows. 

(1) Pick the bin, and suppose it comprises the cells in [kb, (k + l)b). 

(2) Obtain M + M' data points by sampling / at { (k - M + l)b, . . . , kb, . . . , (k + M' - 
\)b,{k + M')b). 

(3) There is a unique polynomial g^ of degree M + M' — 1 going through all those data 
points. Notice that g^ is computed bin-wise. 

Whenever / is polynomial of degree at most M+M' ~ 1 over [(k—M+l)b, (k+M')b\, then 
f = g over this same interval because the polynomial is unique. Moreover, the Lagrange 
coefficients satisfy Proposition 7.2 and thus h(kb) = f(kb) for all k. Then, we can define 
a function by piecing together the bin-wise polynomials and because the Lagrange inter- 
polant is unique, we have A| [jt&,(jfc+i)i») = since both sides of the equation are Lagrange 
interpolants of the same points. That is, there is a unique linear interpolation algorithm of 
degree M + M' —I over M + M' data points, exact for polynomials of degree M + M' — 1. 
For our range queries, the query function / is pieced together from a polynomial (inside 
the range) and the polynomial g(i) = (outside the range). Hence, f(i) — h(i) will be zero 
except for two ranges of width (M + M')b, located around each of the range query's end 
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points. Moreover, £/z(;')a; = Y<f{kb)Bk implies by Eq. (3) that 

n/b 

i i k=0 

= E/(*>< - E^*')^ 

i 

=E(/(0 -mw 

i 

Therefore X) ( -5(z')a,- can t> e computed in time 0(N 2 b), accessing M + M' bins from our 
external array. 

Hence, Eq. (3) tells us how to answer the original query, Y*if(O a i> faster than the 
£l(n) time required without precomputation. The query can be rewritten as the sum of 

Y!k=of(kb)Bk an d Li'8(') a i' (computed in time 0(n/b) and 0(N 2 b), respectively). There- 
fore, we obtain a net reduction of the complexity from n to n/b + N 2 b. 

We note that the N 2 b factor could be improved to Nb, if we precompute h for the b 
possible positions where an endpoint could fall within a bin, and for the N basic functions 
1 , i , i 2 , . . . , i N ~ 1 we expect used with local moments. Each of these Nb values would have 
&(Nb) tabulated entries, leading to a storage complexity of &(N 2 b 2 ), possibly too high for 
the small time savings except for tiny values of b. Yet small values of b would arise in the 
hierarchical setting that we shall next explore. 

8.2 Local Moments Using Hierarchical Ola 

We can further decrease the complexity by a hierarchical approach. We reduced the com- 
plexity from 0(Nn) to 0(Nn/b + N 2 b). After the first transform, the cost is dominated 

by the computation of Y!l=of(^)^k- By me same method, we can reduce the complexity 
further to n/b 2 +N 2 2b. Because in-place storage is possible, this comes with no extra stor- 
age burden. Applying the method log fc n times reduces it down to 0(N 2 b\og b n). The net 
result is similar to Bin Buffering, except that we are able to buffer the first two moments 
simultaneously, using a single buffer. 

9. USING OVERLAPPED BIN BUFFERING FOR FAST LOCAL MOMENTS (OLA): AL- 
GORITHMS AND EXPERIMENTAL RESULTS 

This section puts the ideas of the previous section into practice, presenting more details of 
Ola and presenting an experimental analysis of Ola as an example of Hierarchical Bin 
Buffering. There are three fundamental operations on Ola buffers: building them initially, 
using them for fast queries, and finally, updating them when the underlying data changes. 
Each fundamental operation will, in turn, be described and its complexity analyzed. How- 
ever, our complexity analysis is typical and ignores system-specific factors including the 
relative costs of various mathematical operators and the effects of the memory-access pat- 
terns on a computer's memory hierarchy. To show that the algorithm can be efficiently 
implemented, we we coded Ola in C++. The performance of our implementation com- 
pletes the discussion of each fundamental operation. 

To enable replication of our results, we next provide some details of our implementation 
and the test environment. Our experiments were conducted with N being even, and we 
chose M = M' = j ; these constraints were imposed by our implementation. Our test 
platform was a Pentium 3 Xeon server with 2 GB RAM running the Linux operating system 
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(kernel 2.4), and the software was compiled using the GNU compiler (GCC 3.2 with - 
02). For these experiments, we simulated arrays of any size by "virtual arrays" defined by 
a,- = sin(z'): the function sin is chosen arbitrarily and the intent is that the access time to 
any one array element is a fixed cost (a calculation rather than a memory or disk access). 
Results are thus less dependent on current disk characteristics than would be otherwise 
possible. See section 9.4 for disk-based experiments. 

For our experiments, array indices were always 64 bits, whereas stored values are 32- 
bit floating-point values. Unless otherwise specified, n w 2 30 , giving us about 4 GB of 
virtual-array data; by "size", we refer to n, the number of array elements. As well, the 
buffer always fit within main memory and no paging was observed. We considered several 
different values of b: 32, 128, 1024, 2 15 and 2 20 . The first three values imply hierarchical 
Ola and can be justified by the ratio of main memory to external memory on current 
machines. The last two values imply one-scale Ola and fit the introduction's scenario that 
y/n would be an appropriate internal buffer size (b = 2 15 ), or fit a scenario where the user 
wants whatever gains can be obtained from a tiny buffer. We chose the value b = 128 as 
the "typical" value when one was needed. 

For N, we experimented with values 2, 4, 8 and 16, with 4 deemed the typical value. 
Value 2 does not enable all the local moments that are likely to be used in practice, whereas 
we could not imagine any scenario where 16 th or higher moments would be useful. 



9.1 Computing the Ola Buffer 

The construction of the Ola Buffer is possible using one pass over the external array, as 
illustrated in Algorithm 1 . 

Each buffer component is over a number of bins that depends linearly on the number of 
buffered moments. Similarly, as the size of the input array increases, we expect a linear 
increase in the construction time. The reason is that the number of buffered components 
increases linearly with n: although the number of buffer scales, p, increases with n, we 
have Y%=i n /b k € ©(«). Each component has a computation cost that depends only on N, 
leading to an overall construction time of 0(Nn). 

Of course, the storage required is inversely proportional to b: n/b. However, we do not 
expect the construction time to vary significantly with b as long as the buffer is internal 
(n/b is small). Indeed, when b grows, then the cost of computing each buffer element 
grows linearly and is proportional to Nb. On the other hand, the number of buffer com- 
ponents decreases with l/b. In total, the cost is roughly independent of b. If n/b grows 
substantially, then we might expect a slight time increase due to poorer memory perfor- 
mance on the large array. However, for all our experiments, the buffer size remained much 
smaller than system RAM. Experimental data to substantiate this is shown in Figures 4-5. 
The scale of Figure 4 magnifies what was less than a 7% difference between construction 
times, and we saw a small increase (5%) in construction time when b increased, which 
was not as anticipated. However, the point stands that buffer construction was not heavily 
affected by the choice of b. 
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Algorithm 1 Ola Buffer Computation 

constants: bin size b, even number of buffered moments N and Lagrange coefficients c of 
degree N — 1, p is the largest integer such that n/b$ >N. 

function computeBuffer(a): 

INPUT: an array a 

OUTPUT: an array B (Ola buffer) 

B <— onestep(a,0) 

for s = 0,1,..., 0-1 do 

B' <— onestep(Z?,s) 

for k e {0, 1, . . . ,size(B') - 1} do 
B w +\ <- B' k 

end for 
end for 



function onestep(a,s): 
INPUT: an array a 
INPUT: a scale parameter s 
OUTPUT: an array B 



Allocate 



1 



size(a) 

for i = 0, 1, ... , [size(a)/b s \ + 1 do 
if i is a multiple of b then 



components in zero-filled array B 



B [i/b\ ^ B \i/b\ + a ib s 



else 

for m = — 

B [i/b\ +m 

end for 
end if 
end for 



fdo 

C—mb+i mod b a ib s 



+m 



9.2 Fast Local Moments Using the Ola Buffer 

The algorithm for fast queries follows from Proposition 7.1: recall that we choose M' = 
j ,M = j. As a first step, we have to compute 8(;)fl,' where 

8(0-/(0- I' f(kb) Ci . kb . 

Then for scales s= 1 , . . . , B, we add Zi 5^(0^-1 where 

L-J+- 

8W(i) = /(^)- 

*=L|J-f+i 

Finally, at scale s = P, we add to the previous computations Yll=o f(kb^ +l )B kh f,-i . 

The key to an efficient implementation is to know when 8(0 — or 8^ s \i) — will be zero 
given a function /, so that we only sum over a number of terms proportional to Nb at each 
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Fig. 4. Time in seconds for the construction of an OLA buffer, with N = 4 and varying values of b. 
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Fig. 5. Time in seconds for the construction of an Ola buffer, with b = 2 15 (one-scale Ola) and varying values 
of N. Times were almost identical for hierarchical OLA using b = 128. Curve t(N) = 268/V + 450 (shown) fits 
the points well. 



scale. In other words, we need to do a lazy evaluation. Suppose that / is a polynomial 
of degree at most 1 for Af even over the interval \p,q] and the overlap parameters are 
M = j,M' = j , then 8^ may be nonzero only over intervals [p',p"] and [q',q"], where 
P' = (L^rJ - %)b, P" = (L^rJ + N 2)b, 4 = (L^rJ - \)b and q" = (L^J + %)b. The 
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complete pseudocode is given in Algorithm 2. 



Algorithm 2 Moment Computation using the Ola Buffer 

constants: bin size b, even number of buffered moments N and Lagrange coefficients c of 
degree Af — 1 , p is the largest integer such that n/b$ >N. 

function query(a, B, f, p,q): 
INPUT: an array a 
INPUT: an Ola buffer B 

INPUT: a function / which is a polynomial of degree at most N — 1 over [p, q] and zero 
elsewhere 

OUTPUT: returns S = L^r] /(«'M 

for ;' e candidates(s/ze(a),0,/?,g) do 

S^S+ (f(i) - I^f_ f +1 f{kb) Ci - kb ) ai 
end for 

for s = 1 , . . . , P do 

for ;' e cmdid^tes(size(B)/b s ^ 1 ,s,p,q) do 

S^S+ (f{W) -L l k i[]]_ n +l f{k¥+ l )ci- k ^j B ibs -\ 

end for 
end for 

s^s+r^-'nicb^B^ 

function candidates(ii'ze,s,/7,^): 
OUTPUT: the union of 



and 




Queries are 0(b$) where p = log fe n, so when b increases the algorithm's running time 
increases in proportion to j^p. Therefore, because the buffer size is given by n/b, the 
algorithm becomes slower as the size of the buffer is reduced. Experimentally, this was 
measured by randomly selecting, with replacement, 2000 of the (j) different non-empty 
ranges. More precisely, we choose two uniformly distributed random numbers a and b and 
pick the interval [min(a,b),max(a,b)). We set N to a "typical value" of 4, and timed the 
2000 sums and 2000 first moments for various b values 2 . 

Results are shown in Figure 6; we also plotted the function t{b) = £>/ (5000 Info). The 
measured running time appears to grow no faster than t (b). 



2 When N 2 b was large, we tested only [ 80 ^°°° J cases, to keep test times reasonable. 
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Fig. 6. Average time (in seconds) for the computation of many randomly selected range sums for N = 4 and 
various values of b. Function t(b) = b/ (5000 In b) is also shown. 



Note that the time for a range query is affected somewhat by the length of the range I, 
in that the number of buffer elements B, accessed will be approximately I jb for one-scale 
Ola. (For hierarchical Ola the relationship between the number of buffers accessed and 
the range length is much more complex.) As well, for hierarchical Ola, the number of 
hierarchical levels processed will also depend on the precise positioning of the range's 
endpoints. To see these effects, we plotted the time 3 versus the range size. Results for 
One-Scale Ola (see Figure 7) are as expected: the time was dominated by the (unvarying) 
work done around the range's endpoints. There was a small additional contribution coming 
from the number of buffer values accessed, which showed up as a slight upward slope on 
the cluster. First moments and sums behaved similarly. 



3 Timed on a slightly faster Pentium 4 machine with 512 MB RAM, running a Linux 2.4 kernel that had been 
patched to supply high-resolution timings and hardware performance counts via PAPI[Browne et al. 2000]. 
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Fig. 7. Average time (in seconds) versus range length for the computation of 1,525 randomly selected first 
moments (x) and 1,525 sums (+) for N = 4 and b = 2 15 . (One-Scale OLA.) 



CO 
T3 

o 
o 

CD 

CD 
E 



0.004 
0.0038 
0.0036 
0.0034 
0.0032 

0.003 



XX X X 



x + iii. 



X x x+xx + xx 

Hi i mi mi iipwm 1 1 mil i ii i i mi ii ri ii f h ; ' 

le^xeesafioox x«x< x x x x< x xx 



2e8 4e8 6e8 8e8 1e9 
b 



Fig. 8. Time in seconds versus range length for 2,000 randomly selected first moments (x) and 2,000 sums (+) 
with N = 4 and b = 128 (4-Scale OLA). 



The situation is more complex for hierarchical Ola (see Figure 8), where the positioning 
of the range determines the number of buffer values and where the positioning of each 
endpoint determines how many external array elements are accessed and determines how 
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♦ 

2 4 6 8 10 12 14 16 
N 

Fig. 9. Average time per query (seconds) versus N for randomly selected range sums with b = 2 15 (one-scale 
OLA). (The points fit t(N) = WIN 2 + .08JV well.) 

many hierarchical scales need to be considered for the region surrounding each endpoint. 
Since the running times are smaller, it is perhaps not surprising that the data appears noisier. 

We can show that query times for hierarchical Ola grow quadratically as the number 
of moments buffered is increased. For one-scale Ola, queries have two main sources 
of cost: first, the cost from computing Y,kf(k)Bk, where / is piecewise or an 1 st 
degree polynomial, which we evaluate at a cost of &(N) per point within the range of the 
query. The expected range of our queries is long, so this cost is significant. The second 
cost of our queries comes from a &(N 2 ) calculation done around the range's endpoints. 
Therefore, for our small values of N, the total cost includes both a large N 2 as well as a 
large linear component. From a theoretical point of view, however, the growth is &(N 2 ) 
and is dominated by the endpoint computations. (See Figures 9 and 10). Hence, it might 
be detrimental to buffer many more moments than we require. However, the number of 
moments has no effect on the space complexity, unlike the bin size, b. Therefore, for large 
enough arrays, even if we buffer many moments, the Ola approach will still be several 
order of magnitude faster than unbuffered queries. 

The Ola approach was not sensitive to the query: range sums or first moments were 
measured to take almost exactly (within 1 %) the same time. Therefore, these results are 
not plotted. 

Based on the theoretical analysis, Ola can permit huge query speedups, given ex- 
treme values for parameters such as the relative speeds of internal versus external memory, 
amount of memory allocated to the buffer, and so forth. However, we need good speedups 
for "reasonable" parameters. From our experiments, it is evident that buffered arrays were 
considerably faster, and random queries that averaged about 71.6 s without buffering could 
be answered in 0.390 s or 0.00605 s when the 4 GB dataset was buffered with 128 kB 
or 32 MB (using N=4). This corresponds to respective speedups of 184 and 11800. The 
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Fig. 10. Average time per query versus N for 2000 randomly selected range sums with b = 32 (3- or 4-Scale 
OLA). The running time fits t(N) = (N 2 + 1 1.7iV)/l 1700 well. 



construction time of approximately 1500 s means a total construction+query break-even is 
achieved after about 21 queries. 

9.3 Updating the Ola Buffer 

To update the buffer, we can consider how the buffer was originally constructed: it was 
computed from the data source and then, in a hierarchical manner, buffers were computed 
from the previous buffer and stored in place. Recall, for instance, that in the Ola buffer, 
values of the second-scale buffer are stored in indices b,2b,...,(b— l)b, (b + l)b, (b + 
2)b, (2b — \)b, (2b + l)b,... whereas the values of the third-scale buffer are stored at 
b 2 ,2b 2 , . . . , (b — l)b 2 ,(b+ l)b 2 7 ... and so forth. We define the cells at scale s as those 
having an index divisible by with the expository convention that "scale 0" refers to 
entries in the external array. Our updates propagate changes from smaller scales to larger 
scales. (See Algorithm 3.) 

To understand the update algorithm acting upon this hierarchical buffer with in-place 
storage, it is helpful to consider the "is computed from" relation between cells, which 
forms a directed acyclic graph (dag). Showing each cell at every scale to which it belongs, 
coloring (black) the largest scale for each cell, and focusing only on the part of the dag 
that needs to be updated, we obtain Figure 11. The black vertices at scale s (for s > 1) in 
the dag correspond to indices i such that L^rJ i s not divisible by b, that is, cells that do 
not belong to scale s + 1 . The uncolored vertices belong to scale s + 1 and in-place storage 
means that an update affects the black node beneath it (except for the largest scale since 
the algorithm terminates). The portion of the dag that is reachable from the changed cell 
(at scale 0) is called the update dag. 

From this, we see that the update cost is linear with the height of the update dag. To 
prove it, we first observe that we can bound, independently of the height of the dag (given 
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by p ~ log b n), the number of cells per scale that need to be updated. 

PROPOSITION 9.1. By Algorithm 3, given an update of one cell in the external array, 
updates are propagated from one scale to another over at most 2N cells. In other words, 
the update dag (as in Figure 11) has at most 2N nodes at each level. 

Proof. Let lr s \ be the difference in indices between the last modified buffer cell and 
the first modified one (ordering is by indices) at the end of step s in Algorithm 3. In other 
words, li s ~j is the "range" of the modified cells at step s. By convention, = 0. From 
the algorithm, we see that Iia < l( s -i) + Nb s . Hence, we have that lt s \ < 2Nb s so that 
li s )/b s = 2N. Hence, each time we move from one scale to another, at most 2N cell values 
are modified. □ 

This proposition tells us that the middle for loop in Algorithm 3 has at most 2N steps; 
since we do 0(N) operations within each, the update complexity is 0(N 2 $). 

Correctness of the algorithm is straightforward and relies on the fact that index i is 
processed only at its largest scale (see the first if statement). At this time, all updates to i 
will have been completed. 

The effect of N on computational cost as measured experimentally is given by Figure 13. 
The observed relationship appears linear, but the three collinear points are misleading; for 
N = 2 and N = 4, we had (3 = 4. However, for N = 8 and N = 16, we had p = 3. As 
well, since only 0(N) hashmap entries are created (and then updated 0(N) times each), 
if hashmap-entry creation is expensive, then the running times will contain a large linear 
component in N. 
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Cells in the Ola buffer (scales > 0) 
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Fig. 11. Update dag for Ola buffer with M = M' = 2 (N = M + M' = 4) and b = 2 (see Algorithm 3). Each 
column with an entry at Scale 1 corresponds to a buffer cell, whereas entries at Scale are in the external array. 



If the original array was not dense, that is, if most components were zero, then it can 
be more efficient to construct the buffer starting with a zero buffer and then adding each 
non-zero value as an update. Because the cost of each update is 0(A^ 2 p), if there are d(n) 
non-zero values in the original array, then the complexity of building the buffer through 
updates is 0(d(n)N 2 $). This is asymptotically better than Algorithm 1 whenever d(n) £ 
o(n/N$). Experimentally, for = 4 and b — 128, we made 200k random updates in about 
12 seconds. Thus, even if data items in a sparse set were added in an unordered manner, 
it would be faster to build the buffer through updates if it had about 24 million or fewer 
elements, or a density of ^f^r = 2A% or less - S 

ince update time decreases rapidly with 
b whereas the time for Algorithm 1 is almost independent of b, once b > 2 15 incremental 
construction is a reasonable alternative to Algorithm 1 for any data set. 
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Algorithm 3 Updating the Ola Buffer. 

constants: bin size b, even number of buffered moments N and Lagrange coefficients c of 
degree N — 1, p is the largest integer such that n/b$ >N. 

function update (B, j, A): 
INPUT: an index j in the original array a 
INPUT: an Ola buffer B over the array a 
INPUT: the change A in the value of a/ 
RETURN: modifies B 

deltas is a (hash) map { assume for unassigned values } 

deltas j <— A 

for s — 0, . . . , p do 

Let keys(deltas) be the set of keys for the hash table deltas at this point 
{ Invariant: keys(delta) contains only indices at scale s] 
for ;' e keys(deltas) do 

if L^J is not divisible by b then 

{Process ;' because s is its largest scale} 
8 <— deltasi 

for m = - j + 1 , . . . , j do 

deltas ([-^ n \+m)¥+ l ^~ deltas '(i^j+^+i +c_ fom+(L ^j mod&) 8 
end for 

if ;' is divisible by b then 

{ Only (possibly) j is not a multiple of b} 

Bi/b <- Bj/ b + deltasi 
end if 

remove key i from deltas 
end if 
end for 
end for 

{ Cells belonging to scales p + 1 and above are still in delta and need to be added to the 
buffer (see uncolored nodes at the last level of Figure 1 1).} 
for ;' e keys(deltas) do 
Bi/b <- Bi/b + deltasi 
end for 



A key point is that updates to the buffer get progressively less expensive as b goes up 
and the size of the buffer goes down. Figures 12 and 13, as well as Table II provide 
experimental evidence of these claims. 
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Fig. 12. Average time versus b for 200,000 random updates with N = 4 
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Fig. 13. Average time versus N for 200,000 random updates with b = 128 (3- or 4-Scale Ola). 



9.4 External Memory 

The use of "virtual arrays" in the previous section allowed us to abstract away from the 
specific details of current memory-system hardware. However, one might abuse such sim- 
plified models, thus incorrectly predicting good practical performance for algorithms that 
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Table II. Linear relationship ob- 
served between P and update time. 
N = 4 and average time was over 
200,000 random updates. 



b 


P 


time (ps) 


time/p 


32 


5 


68.5 


13.7 


128 


4 


58.1 


14.5 


1024 


2 


25.6 


12.8 


2 15 


1 


14.9 


14.9 


2 20 


1 


12.0 


12.0 



make irregular and non-local accesses to disk. Nevertheless, our algorithms for buffer con- 
struction and queries tend to have good locality. For instance, with queries in One-Scale 
Ola, two consecutive groups of indices (around either endpoint of the queried range) are 
accessed. Experiments to support our claims were derived using memory-mapped files. 
Unfortunately, due to our experimental setup (mainly a 32-bit address space), we were 
forced to choose a smaller value of n w 2 28 elements, or about 1 GB of data. These ex- 
periments were performed on a computer with 512 MB of RAM, and since much I/O was 
anticipated (and observed), we took wall-clock times while the system ran in single-user 
mode. 

Repeating the experiments in which we timed the construction of an Ola buffer with 
N = 4 and varying b, we obtained the results shown in Figure 14. We note that the dis- 
crepancy for b = 128 does not seem to be an error: it was repeatable. Except for this one 
value of b, we see that changes in b affected construction time by less than 10%. The large 
discrepancy at b = 128 apparently came from the operating system and system libraries 4 . 

By conducting the experiments of subsections 9.1 to 9.3 with virtual arrays, we avoided 
many secondary system-level effects that might have obscured our results. But it is useful 
to compare the timings obtained with (realistic) memory-mapped array versus those from 
our synthetic virtual arrays. For reference, with virtual arrays and n w 2 28 , a construction 
time of about 250 s was obtained with N = 4, for all values of b. We see that the virtual 
array lead to a construction time that was approximately three times longer. 

We also timed random range-sum queries using our memory-mapped array (see Fig- 
ure 15). For comparison, similar random queries were also computed directly from the 
external array, without using the Ola buffer at all. Despite the good locality of the obvi- 
ous algorithm for this task, with N = 4- and b = 128, Ola answered the query less than 
.014 seconds, versus 6.3 seconds when no buffer was used: a speedup of over 400. For 
reference, a virtual array lead to query times that were approximately 11% slower than 
with the memory-mapped implementation for b = 2 15 , N = 4 but about 75% faster for 
£=128, TV = 4. 

Comparing Figure 15 to Figure 6, we observe that with the memory-mapped array, the 



4 Using the same hardware, but with the Linux kernel upgraded to version 2.6.20, glibc to version 2.5, and the 
GNU C++ compiler to version 4.1.2, we obtained different results: the cases b = 32, b = 128 and b = 2 20 were 
similar. (The median time of 25 runs for b = 128 was no more than 5% larger than the medians of the other two 
cases.) The cases of b = 1024 and b = 32768 were similar, their medians being slightly less than 20% faster 
than b = 128. Repeating tests for N = 16, all cases except b = 2 20 were similar to one another, whereas b = 2 20 
was approximately 50% slower than the others. The effects of varying N and b are described in Section 9.1; 
we conjecture that some combinations of N and b produce page-access streams that are easier for the operating 
system to handle efficiently. Further investigation is outside the scope of this paper. 
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query time is not as sensitive to differences in b, when b is small. Presumably this is due 
to blocking on disks: even when b is small, at least an entire virtual memory page or disk 
block needs to be dedicated to the area around each endpoint of the query range. 
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Fig. 14. Time to construct an Ola buffer (N = 4) from a memory-mapped disk file, versus b. 
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Fig. 15. Average time to answer a random range-sum query from an OLA buffer (N = 4) versus b. A memory- 
mapped file with n ss 2 28 4-byte floating-point numbers was used. 
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10. OLA VERSUS BIN BUFFERING 

Assume that we are given the task of buffering N moments with a fixed amount of internal 
memory K over a very large array of size n. Recall that given an array of size n and a 
buffer size b, Ola will use a buffer of size n/b+l. Hence, Ola would lead to bins of size 
b = n/(K—\)fan/K whereas B IN BUFFERING would use larger bins of size b' =Nn/ K. 

Assume we can read bins of size b from external memory with a fixed cost of units 
of time, and we can access internal memory cells with a cost of 1 unit of time. For sim- 
plicity, we also assume that Nb < Eb, which seems likely given the small values of N 
anticipated. One-scale Ola and Bin Buffering have then exactly the same complexity, 
that is, queries have worst-case complexity 0(NE/,+K). The hierarchical versions also 
have similar complexity to one another. 

However, not all queries have the same cost: the two algorithms are not equivalent. BIN 
BUFFERING will support ( K ^) ranges without any access to the external array: all range 
queries from bin edges to bin edges can be answered entirely from the buffer. For instance, 
^1=2k/n 1 a > can ^ e answered by summing 8 buffer elements. 

However, for some applications, we might be interested in how well we can approximate 
the query without access to the external array. This is especially important in applications 
such as visualization, where a very fast initial approximation is valuable. 

To explain why Ola is more competitive in providing good approximations using only 
the buffer, take the case where N = 2 and assume that the values in the external array are 
uniformly bounded in absolute value by K. That is, |a 7 | < K for all j. Recall that we assume 
that the internal buffer has a size of K. Then consider range sums such as Y!j=k a j- The 
largest error made by Bin Buffering is bounded by 2b'K = 4xn/K since we miss at most 
one bin at each end (whose total value is at most b'x). We can reduce this bound to 2Kn/K 
by choosing to add bins whenever they are more than half occupied. On the other hand, the 
largest error that Ola can make is bounded by 2k(6/2) = Kn/K. Indeed, the worst error 
is reached when range edges match with bin edges. In that case, we wrongly take a full 
bin at each end. Actually, due to the linear decrease in c, values, the more distant values 
in these bins are weighted lightly; this leads to the b/2K bound on each bin. Hence, Ola 
is twice as accurate for estimating range sums from the buffer when N = 2. Irrespective 
of N, the error for Ola is more than bounded by 2bK = 2Kn/K. However, the error made 
by Bin Buffering can be as bad as b'K = Nkti/K, even taking into account the possible 
improvement one gets by including bins that are more than half used. In other words, as 
N grows, the worst-case error made by Bin Buffering grows linearly, unlike Ola. This 
result applies to hierarchical versions of these algorithms as well. 

Hence, one might want to look at the case N = 8. This is not unreasonable in a visual- 
ization setting where the user can set the degree of the polynomials to be fitted. In such a 
case, Bin Buffering has very large bins (8 times larger than Ola) which might be un- 
desirable: the approximation power of Bin Buffering for range sums is at least 4 times 
lower than Ola because of the much larger bins. 

11. APPROXIMATE QUERIES USING OLA 

We have seen that Ola can have a competitive advantage when we are interested in getting 
approximate queries out of the memory buffer. Indeed, Ola supports a wide range of 
query types using a single memory buffer and relatively small bins. However, as with 
wavelet-based techniques [Vitter et al. 1998; Chakrabarti et al. 2001; Schmidt and Shahabi 
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2002], Ola can support incrementally better estimates. With wavelet-based methods, one 
gets approximations by selecting the most significant wavelet coefficients. However, this 
approach calls for storing all coefficients in order to quickly answer queries within a user- 
specified error bound. This would be unacceptable for many applications: the wavelet 
buffer is as large as the external array itself. Incrementally better Ola approximations can 
be computed by first using the internal buffer, and then adding bins one by one. Indeed, 
for a given range query, the Ola algorithms involve many bins at both ends of the range. 
However, only the first few bins have a significant contribution. 

Recall subsection 8.1, in which we showed that Y.f(kb)Bk was given by £ft(z')a ( - where 
h is a Lagrange interpolation of the range query function /. Only when the function / goes 
from a polynomial to is there a difference between the target range function / and the 
range function h estimated by bin-wise Lagrange interpolation. However, the error made 
by Lagrange interpolation also diminishes as we move away from the bin containing the 
edge of the range. In many cases, it might be sufficient to take into account only 1 or 3 bins 
near the edge (at each endpoint of the range). For example, consider N = 4, b = 1024 with 
f(x) — 1 for x > xq and otherwise. A numerical evaluation shows that using only one bin 
at each endpoint instead of the required 3 will take care of 86% of the error when range 
edges are in the middle of a bin. The result is more significant for larger N, for example, 
for N = 16, we need 15 bins at each endpoint for a complete evaluation; however, if we 
use only 5 centered, we take care of 97% of the error. (See Figure 16.) In short, Ola can 
provide wavelet-like progressive evaluation of the queries simply by querying fewer bins 
in the external array. 

We can analyze more mathematically the relationship between the number of bins used 
at each end of the range and the error. First note that bin-wise Lagrange interpolation is 
linear: if we interpolate the sequence {0,0,0, 1,0} and the sequence {0,0,0,0,2}, then the 
sum of the two interpolants is just the interpolation of the sequence {0,0,0, 1,2}. Hence, 
it is sufficient to consider only one non-zero sample value at any given time. We proceed 
to show that the contribution of a sample value f(kb) to bin-wise Lagrange interpolation 
decays quickly past one bin. Let N = M' +M be fixed, and consider the interpolation of the 
sequence xq = 1, x^ — for all i ^ 0. We can then consider the interpolant h in the k th bin 
defined by the interval [bk,bk+b). The following polynomial (refer to Eq. (5)) describes 
h: 

X\lM-u^(-kb + bi) 

where x e [bk, bk + b) . The formula is only valid for — M' < k < M— 1; elsewhere h 
is identically zero. Clearly the denominator will increase sharply in absolute value as k 
grows. We show that the numerator is non increasing in k. Setting y — x — kb £ [0,b], 
we have Ytf=M- h¥ k( x ~ kb + bi ^ = + However, I^.^fy + ib) = 

Yl M ' (y+ib) 

' =l (y+kb) — an ^ because y e [0,b], y + kb e [kb,kb + b] and so, the numerator goes down 
in amplitude as l/k. On the other hand, the denominator in absolute value, (b(—k + M — 
1) ) • • • (b) (b) ■ ■ ■ {b{k + A?) ) = b M+M ' (M— 1 —k)\(k + M')\. Setting X = k + M', we have 
b M+M> (Af _ i — Jt)!(Jt + Af )! = b M+M ' (M + M' — l — \)\M = b M+M ' (N— 1 — \)\M which 
has a rate of increase of starting at (N + 2)/(N - 2) and rising with k. Hence, the amplitude 
of the polynomial decreases faster than exponentially as k increases. 
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Fig. 16. Given a step function / going from to 1, we show that the Lagrange interpolation h is quite close to / 
as we move away from the discontinuity. In this figure, N = 16 and b = 8. 

It is difficult to compare the progressive approximation we get using this approach with 
related wavelet-based ones. Wavelet-based algorithms do not use the original array as 
a data source when answering queries and thus, they have much larger storage require- 
ments. For large-scale applications, approximate queries are required for wavelet-based 
algorithms because storing all the coefficients is unthinkable whereas Ola has progressive 
approximate queries as an added option. 

1 2. CONCLUSION AND FUTURE WORK 

This paper has considered bin-buffering algorithms and showed that using a hierarchical 
approach, highly scalable algebraic queries were possible even with a small buffer. Using 
overlapped bins, we have shown that we could buffer several local moments simultaneously 
and use much less storage than wavelet-based approaches while still supporting progressive 
queries and very scalable queries and updates. 

In short, we showed that N local moments could be buffered using only a single real- 
valued buffer: using bins of a fixed size irrespective of N. Other types of range queries 
could also be grouped and buffered efficiently together [Deligiannakis and Roussopoulos 
2003]. By a direct product [Lemire 2002], Hierarchical Bin Buffering and therefore the 
Ola approach can be generalized to the multidimensional case. 

Some implementation issues were not addressed. For example, many forms of buffering 
using finite-accuracy floating-point numbers are susceptible to significant numerical errors. 

Finally, the source code used for the production of this paper is freely available [Lemire 
and Kaser 2007]. 
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